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3 ■ ABSTRACT 



We consider the problem of estimating the population probability distribution 
^ ' given a finite set of multivariate samples, using the maximum entropy approach. 

In strict keeping with Jaynes' original definition, our precise formulation of the 



Y^ ' problem considers contributions only from the smoothness of the estimated dis- 

tribution (as measured by its entropy) and the loss functional associated with 



its goodness-of-fit to the sample data, and in particular does not make use of 
pH '. any additional constraints that cannot be justified from the sample data alone. 

Q^' By mapping the general multivariate problem to a tractable univariate one, we 

are able to write down exact expressions for the goodness-of-fit of an arbitrary 
ly-s ' multivariate distribution to any given set of samples using both the traditional 

^ , likelihood-based approach and a rigorous information-theoretic approach, thus 

CO ' solving a long-standing problem. As a corollary we also give an exact solution 

Cn , to the 'forward problem' of determining the expected distributions of samples 

^ri ' taken from a population with known probability distribution. 

o 
o . 

■^ . 1. Introduction 

O 

c/3 ■ According to Jaynes^, tlie maximum entropy distribution is "uniquely determined 

as the one which is maximally noncommittal with regard to missing information, in 

P^. that it agrees with what is known, but expresses maximum uncertainty with respect 

to all other matters"^. 
k> i On the other hand, Kapur and Kesavan^ state that "the maximum entropy distri- 

H ' bution is the most unbiased distribution that agrees with given moment constraints 

because any deviation from maximum entropy will imply a bias" . 

While the latter neatly encapsulates the modern interpretation of the maximum 
entropy principle in its application to density estimation, it is not equivalent to the 
definition given by Jaynes as it restricts its use to the case where the moments of the 
population distribution are already known. 
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While this restriction may be convenient, it is not vahd in any case in which one is 
not simply trying to re-derive a standard distribution based upon its known moments 
using maximum entropy principles. Rather, in practical applications the moments 
of the population distribution are not (and indeed cannot) be known a priori, and 
certainly cannot be determined on the basis of a finite number of samples. 

In this paper, we give an explicit and exact expression of the maximum entropy 
density estimation problem in a form which is strictly in keeping with Jaynes' original 
(and precise) definition. 

2. Reformulating the MaxEnt Problem 

So let us return to basics and consider the problem of estimating the multivariate 
population density distribution given a finite set of samples taken at random from 
the population, assuming that the raw sample data is the only prior information we 
have. In this case, which is clearly of the most general practical applicability, the 
requirement that the maximum entropy distribution 'agrees with what is known' is 
equivalent to the requirement that the population distribution provides a good fit to 
the sample data. In this sense the maximum entropy distribution can be defined as 
"the distribution of maximum entropy subject to the provision of a good fit to the 
sample data", with the only potential uncertainty lying in the relative importance 
which should be attached to each of the two contributions. While this uncertainty re- 
fiects the supposed ill-posedness of the density estimation problem, Jaynes' definition 
implies that there should in fact exist a unique solution, so that even this uncertainty 
is in principle resolvable. While we do attempt to resolve this issue here, the matter 
certainly deserves further attention. 

The definition given in the last paragraph allows us to formulate the maximum 
entropy multivariate density estimation problem in precise mathematical terms. If 
we denote the estimated distribution by /(r) where r G R^, and the sample data set 
by {xi, . . . , xn}, we would like to maximise the functional defined by, 

F[f{r)] = S[f{r)]+aG[f{r),{x,}], (1) 

where S[f{r)] is the normalised", sample-independent entropy of the estimated dis- 
tribution over its domain of definition, 

S[f{r)] = - 1^ / fir) log /(r) dr , (where V = J dr) , (2) 

and G[f{r),{xi}] is a precise measure of the goodness-of-fit of the distribution to 
the sample data. An optional tunable variable a G [0, oo] has been included which 
parametrises the solutions. It is clear by inspection that a = implies the sample- 
independent maximum entropy solution represented by the uniform distribution /(r) = 

"The normalisation factor is fixed by requiring that the entropy of the uniform distribution be unity. 



constant, while the hmit a ^ oo corresponds to the distribution providing the best 
fit to the data without regard to its entropy. The solution for any other value of a will 
represent some trade-off between maximising entropy and maximising the goodness- 
of-fit. The fact that neither of the two extremal solutions would be of use in practical 
applications does support the argument that there should exist an optimal value for 
a (presumably unity), and hence a unique optimal density estimate. We will come 
back to this point later. 

3. Establishing the Goodness-of-Fit 

We have yet to give the expression for the goodness-of-fit ^[/(r), {xj}]. In the 
absence of an analytically rigorous and generally applicable measure of goodness-of- 
fit, various ad hoc schemes have been used in the past^'^. As we will show, there do 
exist unique analytical expressions for the goodness-of-fit of an arbitrary multivariate 
probability distribution /(r) to a given set of sample data {xi, . . . ,X7v} depending 
on whether a likelihood-based or information-theoretic approach is used. While the 
former does correctly provide the likelihood of obtaining any particular set of sample 
values assuming a given population distribution, we will nevertheless demonstrate 
that it is the information-theoretic approach that is the appropriate one to use to 
find the population distribution which best accounts for the samples observed. 

3.1. Mapping Multivariate Estimation to a Univariate Problem 

It happens that there exists a well-defined procedure for mapping the complex 
multivariate problem into a tractable univariate one. To proceed, one needs to note 
that the probability of a sample taking values in a particular region of R^ is given by 
the area (or more generally the hypervolume) under the curve /(r) over that region. 
Moreover we know that for a probability distribution, the total area under the curve 
is normalised to unity. 

The key step is to define a mapping Cf : R^ -^ I (representing a particular kind 
of cumulative probability density function corresponding to f{r)) from R^ onto the 
real line segment / = [0, 1] as follows, 

Cf{x) = Jf{r)Q[f{x)-f{r)]dr, (3) 

where x G R^ and Q{y) is the Heaviside step function with Q{y) = 1 ioT y > and 
Q{y) = otherwise. The mapping Cf will in general be many-to-one. Its utility 
lies in the fact that if we take the set of samples {xi, . . . , x^} in R^ and map them 
to the the set of points {C/(xi), . . . , Cf{xiy)} on the segment I then, in view of the 
equivalence between the probability and the area under the curve, the goodness-of-fit 
of /(r) to the samples {xj} is precisely equal to the goodness-of-fit of the uniform 
probability distribution g{x) = 1 defined on the segment I to the mapped samples 



{Cf{xi)}. Let us now consider the latter case in more detail. 

3.2. Uniformly Distributed Samples on a Real Line Segment 

Consider a perfect random number generator which generates values uniformly 
distributed in the range [0, 1]. Suppose we plan to use it to generate A^ random 
samples. We can calculate in advance the probability distribution PN,i{x) of the i-th 
sample (where the samples are labelled in order of increasing magnitude), as follows. 

Let Xi be the random variable corresponding to the value of the i-th sample for 
each i = 1 . . . N. Note that the probability of a number (selected at random from 
[0, 1] assuming a uniform distribution) being less than some value x G [0, 1] is simply 
X, while the probability of it being greater than x is 1 — x. Thus, if we consider 
the i-th value in a set of A^ samples taken at random, the probability that Xi takes 
the value x is given by the product of the probability x^~^ that i — 1 oi the values 
are less than x and the probability (1 — a;)^~* that the remaining N — i values are 
greater than x, divided by a combinatorial factor ZN,i counting the number of ways 
N integers can be partitioned into three sets of size i — 1, 1 and N — i respectively, 

pjv,*(x) = P{Xi =x) = Z-^\ x'-\\ - xf-' . (4) 

From simple combinatorics, the value of Z^^i is given by, 

A^' FfiV+l) 

^-^•= (.-i)!(iv-.rrwr(iv-^ + i) ^^ '^"^^•"- *'' 

where -B(p, q) is the Euler beta function which appears in the Veneziano amplitude for 
string scattering^. That this value is correct can be checked using the fact that pN,i{x) 
must be normalised so that J p]\f^i{x)dx = 1, and noting that the resulting integral is 
just the definition of the beta function given above. Note also that if experiments are 
carried out in which sets of N samples are taken repeatedly, the expectation of the 
i-th sample is given by, 

/■I i 

E[Xi] = / xpN,i{^) dx = , (6) 

Jo iV + 1 

for i = 1 . . . N, corresponding to the most regularly distributed configuration of the 
A^ samples possible, and in excellent accord with intuition. 

3.3. The Maximum Likelihood Approach 

Taking a traditional likelihood-based approach, an expression for the goodness- 
of-fit of a set of A^ samples to the uniform distribution on [0, 1] can now be obtained 
by first labelling the samples in order of increasing magnitude and then calculating 
the likelihood given by. 



N 
i=l 

Bearing in mind the mapping Cf : R^ — ^ / defined in (3), we can generalise 
the above to derive an exact expression for the goodness-of-fit of a set of A^ samples 
{xi, . . . ,xn} to an arbitrary multivariate probability distribution /(r), 

N 

L[/(r), {x,}] = L[{Cjix,^}] = UpnACAx,)) . (8) 

i=l 

where the samples are now labelled in order of increasing magnitude of /(xj) and 
hence Cf{xi). 

Let us take a closer look at the likelihood measure of Eqn.(7) and in particular, 
let us consider the simple illustrative case when only two samples are taken from 
the uniform distribution on [0,1]. In this case, the likelihood is maximised if the 
samples happen to take precisely the values and 1. This slightly perturbing result 
is actually correct and is one of the reasons why care must be taken if one wishes to 
apply likelihood-based arguments in the opposite direction to estimate the population 
distribution based upon observed sample data. More generally, the expression (8) for 
the likelihood will always be biased towards the case where the position of the first 
sample coincides with the minimum value of the probability distribution and that of 
the last sample with its maximum. 

These considerations are sufficient to show that the maximum likelihood approach 
to multivariate density estimation is problematic and provide us with good reason to 
seek an alternative, more rigorous approach. 

3.4- The Information Theoretic Approach 

The rigorous alternative lies in taking an information theoretic approach. In- 
deed we will show that it is possible to assign a unique entropy associated with the 
goodness-of-fit of the estimated population distribution to the sample data in the 
same way (see Eqn.(2)) that an entropy was assigned to the estimated distribution 
itself. 

To see how, consider the values yi = Cf{xi) of the samples obtained after having 
mapped them to the real segment using the mapping defined in Eqn.(3). Defining yo = 
and yN+i = 1 for convenience, these values are constrained by the 'normalisation' 
property, 

N+l 

J2yi~ Vi-i = 1 • (9) 

Then by considering each of the A^-|-l gaps between the values as 'sample bins', and 
the size di = yi — yi-i of each gap as the probability associated with the corresponding 



bin, it becomes possible to identify the distribution of the mapped samples on [0, 1] 
with a discrete probability distribution defined over the set of A^ + 1 sample bins. The 
(normalised) entropy associated with the fit of the estimated population distribution 
to the sample data can then be equated with the entropy of the equivalent discrete 
probability distribution, 

1 N+l 

S'[fir),{x,}] ^ - ^^^^^^^^ E d^logd,. (10) 

This is just the discrete version of the expression given in Eqn.(2) for a continuous 
probability distribution. Maximising Eqn.(lO) for the entropy results immediately in 
the desirable property that the samples are equally spaced, namely i/i = i/{N + 1), 
in agreement with the expected values obtained less directly in Eqn.(6). 

The discussion above strongly suggests that the entropy S' of Eqn.(lO) should 
be used instead of the traditional likelihood (as given here by Eqn.(8)), both as a 
measure of the goodness-of-fit of an arbitrary population distribution to a given set 
of multivariate samples, and also as the second term G[f{r), {xi}] appearing in the 
functional of Eqn.(l). 

Substituting (10) into (1), we claim that the rigorous solution to the MaxEnt 
multivariate density estimation problem is given by the probability distribution which 
maximises the functional, 

where, 

d, = Cf{x,)-Cf{x,^i), (12) 

and the mapping Cf is given by Eqn.(3). In (11) the parameter a G [0, oo] can 
be used to tune the solutions, bearing in mind that a smaller value will emphasise 
the smoothness of the resulting distribution, while a larger value will emphasise the 
goodness of fit. Setting a to unity and maximising will in principle give the unique* 
maximum entropy distribution as originally envisioned by Jaynes': 

The information-theoretic approach we have described in this section is more 
rigorous and compelling than the traditional likelihood approach, as clearly evidenced 
by the pleasingly symmetric form (11) of the resulting optimisation problem. In 

''Note that there is some potential for non-uniqueness to creep in due to the possibility of degenerate 
solutions in certain situations such as when A^ is very small or when the samples are distributed 
highly symmetrically. 

'^The fact that the entropy of a uniform distribution over [— oo, oo] is infinite (^ logl^), while being 
finite for other distributions (^ log <T\/2Tre for a univariate Gaussian) , suggests that special attention 
may be required in the non-compact case to prevent one term overwhelming the other. A simple 
way of regulating the problem is to consider non-compact domains as extremal limits of compact 
ones where the distributions are constrained to be smooth and to vanish at the boundaries. 



particular, both terms contributing to the functional are associated with entropies 
- the first term being the entropy associated with the smoothness of the estimated 
population distribution, and the second term being the entropy associated with the 
goodness-of-fit of the distribution to the sample data. 

Algorithms implementing the optimisation procedure are under development, which 
will allow us to calculate specific solutions of (11) and to perform more detailed in- 
vestigations of their properties. We hope to present these results in a future paper. 

3.5. A Corollary: The Forward Problem 

Before ending, it is worth mentioning here as a corollary that the distributions 
PN,i{x) of (4) also help us to solve the 'forward problem', i.e. that of determining 
the expected distributions PNi{r) of any set of A^ samples taken at random from a 
multivariate population where the population density distribution /(r) is given. 

3.5.1. The univariate case 

We would like to know the expected distribution of the samples when the uni- 
variate population distribution /(x) is given. In place of the mapping of (3), it is 
appropriate here to consider the mapping C'r defined by the cumulative probability 
density function, 

C'Ax)= r f\y)dy. (13) 



If the univariate samples are labelled in increasing order of value then their expected 
distributions are given by, 

PNA'')=PNAC'f{x)), (14) 

and these can be used for example to estimate the experimental errors in individual 
sample values given an estimate of the population distribution^. 

3.5.2. The multivariate case 

The forward problem does not have an obvious generalisation to the multivariate 
case because of the lack of an unambiguous definition of the cumulative probability 
density function in that case. Nevertheless we can instead apply the mapping C/ of 
Eqn.(3) (paying careful attention to the degeneracies present) to obtain the following 
expected distributions for the samples ordered as described below Eqn.(8), 



where J{r) measures the (typically {D — l)-dimensional) volume of the degeneracy of 
/(r) (i.e. the volume of the subspace of R^ sharing the same value of /(r)) for each 
value of r. At special values the region of degeneracy may have dimensionality less 
than {D — 1) in which case the value of j9Jvj(r) becomes irrelevant and can safely be 
ignored. On the other hand for distributions which contain D-dimensional subspaces 
throughout which /(r) is constant (the uniform distribution being an obvious exam- 
ple), then special considerations will be required in order to generalise the analysis 
leading to Eqn.(4) for the real line segment to irregular, multidimensional, and pos- 
sibly non-compact spaces. Excepting the simplest cases, such an analysis promises 
to be highly non-trivial and we will not attempt to delve into such intricacies here. 
Note that (15) does not agree with (14) in the univariate case as the labelling of the 
samples and the corresponding interpretations of the distributions are quite different 
in each case. 

It is often assumed that the deviations of individual observations from their ex- 
pected values follow a normal distribution'^ for univariate data, leading to a x^ mea- 
sure of goodness-of-fif? Our exact results in Eqn.(14) demonstrate that this ap- 
proximation only holds if A^ is sufficiently large and only then if /(r) is sufficiently 
well-behaved. We will conclude our analysis at this point. 

4. Summary and Discussion 

The purpose of the present work has been to reformulate the maximum entropy 
(MaxEnt) density estimation problem in a precise way which is in strict keeping 
with its original definition as introduced by Jaynes. The importance of having such 
a precise formulation hardly needs mentioning given the ubiquity of the estimation 
problem throughout the sciences. 

In reaching our formulation we have managed to solve the long-standing problem 
of obtaining an exact expression for the likelihood of observing any particular set of 
sample values when taken at random from a given population. This is useful in the 
experimental sciences for validating theoretical models on the basis of observations. 
As a corollary, we have also been able to propose the solution to the 'forward problem' 
- that of determining the distribution of the samples when the population distribution 
is known. 

The traditional maximum likelihood approach was shown to have some unsatisfac- 
tory features when applied to the problem of density estimation. On the other hand, 
by taking a novel information-theoretic approach we have succeeded in deriving an 
explicit and rigorous entropic measure of the goodness-of-fit of a generic population 
distribution to a given set of multivariate samples. This in turn has made it possi- 
ble to reformulate the MaxEnt density estimation problem in a unique, precise and 



'^A discussion of the tradc-ofF between smoothness and goodness-of-fit in the context of this assump- 
tion appears in Gull (1989)®. 



purely information-theoretic way. 

We have made allowance for the introduction of an optional tunable parameter 
in our expression of the MaxEnt problem which parametrises solutions ranging from 
those with maximal smoothness to those providing maximal fit to the data. The 
effect of this parameter on the solutions has not been discussed in detail here, and 
we intend to come back to it in future once computational algorithms implementing 
the optimisation have been developed. 
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